This report contains the result of the analysis performed on the dataset GSE163165. This dataset derived
from the publication:“The response of two polar monocyte subsets to inflammation” authored by Vishnyakova et al, in July 2021 (https://www.sciencedirect.com/science/article/pii/S0753332221003991?via%3Dihub#sec0135). The transcriptomic profiling was performed to analyze the response of CD14+ macrophages derived from monocytes to an inflammatory stimuli induced with LPS.The experimental design consisted in two groups a control group of non stimulated CD14+ macrophages and the LPS group constituted by CD14+ macrophages supplemented with lipopolisacarides to induce innmune activation.Each group was conformed by three samples. For this analysis, the parameters and methodologies presented by the authors in the methods of the paper have been recreated.However some methodological details are omitted on the paper, on those I have used my own criteria to decided the tools and parameters to use.
First we have to collect all the necessary information for the analysis from GEO.
# load counts table from GEO
urld <- "https://www.ncbi.nlm.nih.gov/geo/download/?format=file&type=rnaseq_counts"
path <- paste(urld, "acc=GSE163165", "file=GSE163165_raw_counts_GRCh38.p13_NCBI.tsv.gz", sep="&");
GSE163165_count <- as.matrix(data.table::fread(path, header=T, colClasses="integer"), rownames=1)
nrow(GSE163165_count) #39376 genes
## [1] 39376
# load the metadata information
library(GEOquery)
## Loading required package: Biobase
## Loading required package: BiocGenerics
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, aperm, append, as.data.frame, basename, cbind,
## colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
## get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
## match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
## Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
## table, tapply, union, unique, unsplit, which.max, which.min
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
## Setting options('download.file.method.GEOquery'='auto')
## Setting options('GEOquery.inmemory.gpl'=FALSE)
## https://bioconductor.org/packages/release/bioc/vignettes/GEOquery/inst/doc/GEOquery.html
gds <- getGEO("GSE163165")
## Found 1 file(s)
## GSE163165_series_matrix.txt.gz
Meta_GSE163165 <- pData(gds$GSE163165_series_matrix.txt.gz@phenoData)
#keep only relevant information for metadata
Meta_GSE163165 <- Meta_GSE163165[,c("title","source_name_ch1","characteristics_ch1","characteristics_ch1.1","cell type:ch1","gender:ch1")]
head(Meta_GSE163165, n=3)
## title source_name_ch1
## GSM4973752 CD14-derived macrophages: control 1 blood
## GSM4973753 CD14-derived macrophages: LPS 1 blood
## GSM4973754 CD14-derived macrophages: control 2 blood
## characteristics_ch1 characteristics_ch1.1
## GSM4973752 gender: female cell type: CD14-derived macrophages
## GSM4973753 gender: female cell type: CD14-derived macrophages
## GSM4973754 gender: female cell type: CD14-derived macrophages
## cell type:ch1 gender:ch1
## GSM4973752 CD14-derived macrophages female
## GSM4973753 CD14-derived macrophages female
## GSM4973754 CD14-derived macrophages female
For some reason the metadata does not contains a column to discriminate between the two groups: treated and untreated cells so I created it based on the information on the column “title”
Meta_GSE163165$treatment <- c("Control", "LPS", "Control", "LPS", "Control", "LPS")
#I will use this new column as factor
Factors_GSE163165 <- Meta_GSE163165[,c("treatment")]
Now, let’s Collect biological information to run NOISeq QC later. NOISeq require as an input the counts, a factor (the variable of interest in our analysis/comparison) and some biological information: such as gene size, localization in chromosome, gene byotype and GC content.
# Write the gene ids in a txt file.
#write.table(rownames(GSE163165_count),"./gene_names.entrez.txt",
#col.names = FALSE,row.names = FALSE,quote=F)
#Upload the annotations obtained from BioMart
# Import the information
annotgene <- read.csv("./mart_export_GSE163165.txt",sep="\t",header = T)
# How many genes do I get annotated?
sum(rownames(GSE163165_count) %in% annotgene$NCBI.gene..formerly.Entrezgene..ID) #25898
## [1] 25898
# Filter the annotation to keep only cannonical genes
annotgene <- annotgene[annotgene$Chromosome.scaffold.name %in% c(as.character(1:22) ,"X","Y"),]
nrow(annotgene) #26105
## [1] 26105
sum(rownames(GSE163165_count) %in% annotgene$NCBI.gene..formerly.Entrezgene..ID) #25872
## [1] 25872
# Filter annotation to remove duplicated genes
annotgene_filt <- annotgene[!duplicated(annotgene$NCBI.gene..formerly.Entrezgene..ID),]
#Verify all size match ok
sum(rownames(GSE163165_count) %in% annotgene$NCBI.gene..formerly.Entrezgene..ID) #25872
## [1] 25872
sum(annotgene_filt$NCBI.gene..formerly.Entrezgene..ID %in% rownames(GSE163165_count)) #25872
## [1] 25872
#Now I can do the assignation of gene names as the rownames
## Overlap between annotation and genes
rownames(annotgene_filt) <- as.character(annotgene_filt$NCBI.gene..formerly.Entrezgene..ID)
sum(as.character(rownames(annotgene_filt)) %in% rownames(GSE163165_count)) #25872 ok!
## [1] 25872
# For NOISeq QC Work only with the annotated genes! lets filter the counts
GSE163165_count_filt <- GSE163165_count[rownames(GSE163165_count) %in% rownames(annotgene_filt),]
#keep the excluded just in case
GSE163165_count_exc <-GSE163165_count[!(rownames(GSE163165_count) %in% rownames(annotgene_filt)),]
#order to keep same order of samples and genes always
annotgene_ord <- annotgene_filt[rownames(GSE163165_count_filt ),]
#checking once more all the sizes fit
sum(rownames(annotgene_ord)==rownames(GSE163165_count_filt)) #25872 ok!
## [1] 25872
NOISeq is a R package for quality control of count data, it allow us to identify and control outliers or weather our data has any bias. With NOISeq we can monitor the quality of our data, take better decisions to improve our analysis, and be aware about the features of our data that can bias the interpretation of our results.
library(NOISeq)
## Loading required package: splines
## Loading required package: Matrix
Factors_GSE163165 <- data.frame(Meta_GSE163165 [ colnames(GSE163165_count_filt),c("treatment")])
colnames(Factors_GSE163165)[1]<- "Group"
#additional biological data in the correct format
lengthuse <- abs(annotgene_ord$Gene.end..bp.-annotgene_ord$Gene.start..bp.)
names(lengthuse) <- rownames(annotgene_ord)
head(lengthuse, n=3)
## 100287102 102466751 100302278
## 2540 67 137
gc <- annotgene_ord$Gene...GC.content
names(gc) <- rownames(annotgene_ord)
head(gc, n=3)
## 100287102 102466751 100302278
## 57.50 61.76 31.16
biotype <-annotgene_ord$Gene.type
names(biotype) <- rownames(annotgene_ord)
head(biotype, n=3)
## 100287102 102466751 100302278
## "lncRNA" "miRNA" "miRNA"
chromosome <- annotgene_ord[,c("Chromosome.scaffold.name","Gene.start..bp.","Gene.end..bp.")]
head(chromosome, n=3)
## Chromosome.scaffold.name Gene.start..bp. Gene.end..bp.
## 100287102 1 11869 14409
## 102466751 1 17369 17436
## 100302278 1 30366 30503
#but I don't like the extended names of the columns, so lets rename it.
colnames(chromosome) <- c("Chromosome","start","end")
head(chromosome, n=3)
## Chromosome start end
## 100287102 1 11869 14409
## 102466751 1 17369 17436
## 100302278 1 30366 30503
data_NOISEQ <- readData(data = GSE163165_count_filt,
length=lengthuse,
gc=gc,
biotype= biotype ,
chromosome = chromosome ,
factors = Factors_GSE163165)
Byotype detection:plot the abundance of the different biotypes (e.g protein coding, lnc-RNA etc) in the genome with % of genes detected in the sample/condition and within the sample/condition.
## Biotypes detection is to be computed for:
## [1] "GSM4973752" "GSM4973753" "GSM4973754" "GSM4973755" "GSM4973756"
## [6] "GSM4973757"
Saturation:Represent the number of detected genes (counts>0) per sample across different sequencing depths.
Length bias:Mean gene expression per each length bin. A fitted curve and diagnostic test are also produced.
## [1] "Warning: 4920 features with 0 counts in all samples are to be removed for this analysis."
## [1] "Length bias detection information is to be computed for:"
## [1] "Control" "LPS"
## [1] "Control"
##
## Call:
## lm(formula = datos[, i] ~ bx)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.341 -4.286 -0.041 3.892 31.588
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.865e+00 6.356e+00 1.080 0.28296
## bx1 -6.467e-01 6.550e+00 -0.099 0.92156
## bx2 4.158e+01 7.001e+00 5.940 5.03e-08 ***
## bx3 2.225e+01 7.768e+00 2.865 0.00517 **
## bx4 4.347e+01 8.338e+00 5.214 1.13e-06 ***
## bx5 2.643e+01 1.037e+01 2.549 0.01246 *
## bx6 3.848e+01 1.685e+01 2.284 0.02470 *
## bx7 -5.504e+00 5.104e+01 -0.108 0.91436
## bx8 2.503e+02 4.230e+02 0.592 0.55552
## bx9 -3.635e+03 6.416e+03 -0.566 0.57246
## bx10 2.543e+04 4.464e+04 0.570 0.57024
## bx11 -1.433e+05 2.516e+05 -0.569 0.57044
## bx12 NA NA NA NA
## bx13 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.357 on 92 degrees of freedom
## Multiple R-squared: 0.7893, Adjusted R-squared: 0.7641
## F-statistic: 31.34 on 11 and 92 DF, p-value: < 2.2e-16
##
## [1] "LPS"
##
## Call:
## lm(formula = datos[, i] ~ bx)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.449 -3.819 0.000 3.399 31.816
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.753e+00 6.313e+00 0.753 0.45348
## bx1 8.020e-01 6.505e+00 0.123 0.90215
## bx2 4.252e+01 6.954e+00 6.115 2.31e-08 ***
## bx3 2.551e+01 7.715e+00 3.307 0.00135 **
## bx4 4.247e+01 8.281e+00 5.129 1.61e-06 ***
## bx5 2.601e+01 1.030e+01 2.526 0.01326 *
## bx6 4.344e+01 1.674e+01 2.596 0.01098 *
## bx7 -2.416e+01 5.069e+01 -0.477 0.63477
## bx8 3.889e+02 4.202e+02 0.926 0.35712
## bx9 -5.773e+03 6.373e+03 -0.906 0.36736
## bx10 4.030e+04 4.434e+04 0.909 0.36569
## bx11 -2.271e+05 2.499e+05 -0.909 0.36582
## bx12 NA NA NA NA
## bx13 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.313 on 92 degrees of freedom
## Multiple R-squared: 0.7921, Adjusted R-squared: 0.7672
## F-statistic: 31.86 on 11 and 92 DF, p-value: < 2.2e-16
GC Bias: Mean gene expression per each GC content bin.A fitted curve and diagnostic test are also produced.
## [1] "Warning: 4920 features with 0 counts in all samples are to be removed for this analysis."
## [1] "GC content bias detection is to be computed for:"
## [1] "Control" "LPS"
## [1] "Control"
##
## Call:
## lm(formula = datos[, i] ~ bx)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.2002 -3.4960 -0.1579 2.6908 13.1433
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.528 5.082 0.301 0.76431
## bx1 7.554 7.187 1.051 0.29598
## bx2 -95.601 100.158 -0.954 0.34236
## bx3 26.191 13.159 1.990 0.04955 *
## bx4 30.641 6.599 4.643 1.15e-05 ***
## bx5 31.020 5.899 5.258 9.56e-07 ***
## bx6 29.512 5.868 5.029 2.46e-06 ***
## bx7 25.608 5.941 4.311 4.11e-05 ***
## bx8 35.197 6.165 5.709 1.41e-07 ***
## bx9 12.516 6.828 1.833 0.07006 .
## bx10 25.331 9.013 2.810 0.00606 **
## bx11 -24.225 33.588 -0.721 0.47261
## bx12 145.233 252.866 0.574 0.56715
## bx13 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.082 on 91 degrees of freedom
## Multiple R-squared: 0.5726, Adjusted R-squared: 0.5162
## F-statistic: 10.16 on 12 and 91 DF, p-value: 2.387e-12
##
## [1] "LPS"
##
## Call:
## lm(formula = datos[, i] ~ bx)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.9529 -3.4501 -0.4583 3.2123 16.6730
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.396 5.188 0.269 0.78853
## bx1 6.565 7.336 0.895 0.37321
## bx2 -130.125 102.244 -1.273 0.20637
## bx3 29.490 13.433 2.195 0.03069 *
## bx4 30.638 6.736 4.548 1.66e-05 ***
## bx5 30.659 6.022 5.091 1.91e-06 ***
## bx6 27.914 5.990 4.660 1.08e-05 ***
## bx7 24.115 6.064 3.977 0.00014 ***
## bx8 32.832 6.294 5.217 1.14e-06 ***
## bx9 10.986 6.970 1.576 0.11847
## bx10 24.708 9.201 2.685 0.00861 **
## bx11 -35.043 34.287 -1.022 0.30946
## bx12 239.832 258.131 0.929 0.35529
## bx13 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.188 on 91 degrees of freedom
## Multiple R-squared: 0.5821, Adjusted R-squared: 0.527
## F-statistic: 10.56 on 12 and 91 DF, p-value: 9.271e-13
Exploratory PCA: Principal component analysis score plots for PC1 vs PC2
I am trying to reproduce the analysis performed by the authors in the original paper described mainly in the supplementary material.For some aspects they don’t specify their criteria so I have made my own decisions.
#DEA with DESeq2 since was the tool used in the publication
library(DESeq2)
## Loading required package: S4Vectors
## Loading required package: stats4
##
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:Matrix':
##
## expand, unname
## The following object is masked from 'package:utils':
##
## findMatches
## The following objects are masked from 'package:base':
##
## expand.grid, I, unname
## Loading required package: IRanges
## Loading required package: GenomicRanges
## Loading required package: GenomeInfoDb
## Loading required package: SummarizedExperiment
## Loading required package: MatrixGenerics
## Loading required package: matrixStats
##
## Attaching package: 'matrixStats'
## The following objects are masked from 'package:Biobase':
##
## anyMissing, rowMedians
##
## Attaching package: 'MatrixGenerics'
## The following objects are masked from 'package:matrixStats':
##
## colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
## colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
## colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
## colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
## colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
## colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
## colWeightedMeans, colWeightedMedians, colWeightedSds,
## colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
## rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
## rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
## rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
## rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
## rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
## rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
## rowWeightedSds, rowWeightedVars
## The following object is masked from 'package:Biobase':
##
## rowMedians
#creating object
GSE163165_DESeq2 <- DESeqDataSetFromMatrix(countData = GSE163165_count_filt,
colData = pData(data_NOISEQ),
design = ~ Group)
## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors
#run DE analysis
GSE163165_DESeq2 <- DESeq(GSE163165_DESeq2)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
In the paper the authors does not specify weather they apply or not, any kind of filtering to the originalcounts to remove low express genes. So I have decided to filter the data by using the HTSfilter package from R.The HTSfilter performs a filtering based on a global jaccard similarity index in order to identify genes with low, constant levels of expression across one or more experimental conditions.
library(HTSFilter)
#S.len=number of tested threshold, default is 100, I put 50 to reduce computational time
filter <- HTSFilter(GSE163165_DESeq2, s.len=50, plot=FALSE)$filteredData
class(filter)
## [1] "DESeqDataSet"
## attr(,"package")
## [1] "DESeq2"
dim(filter)
## [1] 14396 6
Important DESeq2 implements an independent filtering procedure by default in the results function; this filter is another alternative and does not need to be used in addition to the one included in HTSFilter.So to make use of HTSFilter within the DESeq2 pipeline, the argument independentFiltering=FALSE is used when calling the results function in DESeq2.
res <- results(filter, independentFiltering=FALSE)
summary(res)
##
## out of 14396 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up) : 3575, 25%
## LFC < 0 (down) : 3727, 26%
## outliers [1] : 73, 0.51%
## low counts [2] : 0, 0%
## (mean count < 0)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
Now let’s start exploring our DEGs
#Let's plot the distribution of our genes
plotMA(res, ylim=c(-2,2))
I like how my MA plot looks, I would even consider not to shrink the LogFC since the error in FC calculation seems to be uniform (the distribution of the dots is more or less uniform both with low or high number of counts).The HTSfilter helped me to control the possible false positives well. However, I applied the HTSfilter since its use is described in the methods of the paper.
res_shrink <- lfcShrink(filter, coef = c('Group_LPS_vs_Control'))
## using 'apeglm' for LFC shrinkage. If used in published research, please cite:
## Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
## sequence count data: removing the noise and preserving large differences.
## Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895
plotMA(res_shrink, ylim=c(-2,2))
As I expected, I don’t see major differences in the plot. The error on LFC was properly controlled. now, lets explore a bit more the data. Filtering DEGs according to the thresholds reported in the paper: log2FC > 2( meaning a FC of 4) and P-Value < 10e-2
degs = as.data.frame(res_shrink)[which(res_shrink$pvalue <= 0.1 & abs(res_shrink$log2FoldChange)>2),]
dim(degs)
## [1] 1118 5
head(degs,n=3)
## baseMean log2FoldChange lfcSE pvalue padj
## 84069 387.49301 2.878192 0.5012330 7.351238e-10 1.114199e-08
## 84808 59.60164 2.754447 0.6597518 2.338888e-06 1.707436e-05
## 9636 16148.84676 6.573670 0.6161060 8.123671e-28 1.118801e-25
I have obtained substantially less genes than the original author, probably because, since I applied the HTSfilter and then the function for shirink my criteria was very strigent. Now, lets explore if my results are still representative of the biological context. As a next step. The table 1 on the paper show the top 30 up and downregulated genes in LPS vs Controls, lets check if we have the same genes as top DEGs.
#Extracting the top genes.
topdown <- degs[order(degs$log2FoldChange)[1:30],]
head(topdown, n=3)
## baseMean log2FoldChange lfcSE pvalue padj
## 83876 184.13947 -6.874763 1.448829 8.959498e-08 9.037105e-07
## 4605 282.57332 -6.634769 1.410811 9.378244e-08 9.426287e-07
## 140462 11.67954 -6.385633 3.392183 5.951266e-04 2.245521e-03
topup <- degs[order(degs$log2FoldChange, decreasing = TRUE)[1:30],]
topup
## baseMean log2FoldChange lfcSE pvalue padj
## 8710 641.32949 14.158703 3.3098011 1.937340e-21 1.468176e-19
## 3593 227.97986 12.506382 3.1624510 2.108384e-17 1.067081e-15
## 79370 196.91156 12.223771 3.1888746 5.944122e-15 2.107368e-13
## 3620 40777.72290 10.831645 0.6128512 3.140581e-71 3.460196e-68
## 4498 77.46825 10.776572 2.9574174 9.680486e-15 3.262438e-13
## 1440 345.40450 10.728309 1.3909546 6.276311e-18 3.341844e-16
## 3559 13120.45193 10.714874 0.4678675 3.938284e-117 1.410201e-113
## 101929800 638.94088 10.670261 1.3486957 1.204149e-15 4.725211e-14
## 440896 2068.11968 9.818548 0.9273288 1.918991e-27 2.411026e-25
## 6355 46261.81920 9.630847 0.2755659 3.211286e-269 4.599526e-265
## 730249 804.51197 9.611626 1.8703979 8.893149e-09 1.112459e-07
## 10563 527.21467 9.547816 0.9077753 3.547794e-26 3.969926e-24
## 7424 33.50778 9.323975 2.8595864 1.634170e-10 2.816632e-09
## 27074 10936.11125 9.114400 0.9645558 1.627552e-22 1.347481e-20
## 169355 123.10210 9.097070 1.4988580 8.931487e-12 1.909339e-10
## 4489 188.86630 8.980125 1.0755073 2.344335e-16 1.017512e-14
## 3036 26.64492 8.955254 2.7986401 2.897458e-10 4.775638e-09
## 6318 26.97442 8.802759 2.9454294 7.104695e-08 7.299896e-07
## 91543 44620.05332 8.587447 0.9381332 2.769459e-21 2.055283e-19
## 162517 21.46942 8.566139 2.7735824 2.522473e-09 3.511116e-08
## 6363 463.01051 8.496142 1.0918895 6.051309e-16 2.483464e-14
## 10964 15418.64993 8.489990 0.7053236 1.227201e-34 2.547419e-32
## 114897 1699.06567 8.445998 0.5371146 9.022946e-57 5.384819e-54
## 6364 1387.10838 8.372299 0.4481551 9.374036e-79 1.678304e-75
## 7139 18.55670 8.315635 2.7457241 6.087494e-09 7.826856e-08
## 7130 17428.14703 8.262553 0.6442186 6.685316e-39 1.595896e-36
## 3898 1503.09584 8.191451 0.9811922 2.911803e-18 1.616502e-16
## 400759 355.56007 8.162025 0.8790624 1.828013e-21 1.392693e-19
## 5157 159.92285 8.053848 1.2193022 5.823009e-12 1.285100e-10
## 2919 22338.97582 7.933114 0.4762366 1.679919e-63 1.604098e-60
head(topup,n=3)
## baseMean log2FoldChange lfcSE pvalue padj
## 8710 641.3295 14.15870 3.309801 1.937340e-21 1.468176e-19
## 3593 227.9799 12.50638 3.162451 2.108384e-17 1.067081e-15
## 79370 196.9116 12.22377 3.188875 5.944122e-15 2.107368e-13
#Now lets extract the entrez id to convert it to gene name with Biomart to know if the identity of our top genes
ids_down <- rownames(topdown)
ids_up <- rownames(topup)
# Write the gene ids in a txt file.
#write.table(ids_down,"./ids_top_up.txt",
#col.names = FALSE,row.names = FALSE,quote=F)
#write.table(ids_up,"./ids_top_down.txt",
#col.names = FALSE,row.names = FALSE,quote=F)
# Upload the annotations from Biomart
namesup <- read.csv("./mart_export_top_up.txt",sep="\t",header = T)
namesdown <- read.csv("./mart_export_top_down.txt",sep="\t",header = T)
#fixing rownames for compatibility
rownames(namesup) <-namesup$NCBI.gene..formerly.Entrezgene..ID
namesup$NCBI.gene..formerly.Entrezgene..ID = NULL
rownames(namesdown) <-namesdown$NCBI.gene..formerly.Entrezgene..ID
namesdown$NCBI.gene..formerly.Entrezgene..ID =NULL
#merging my top DEGs with the annotation from Biomart
top_degs_up <-merge(topup,namesup,by='row.names', all=TRUE)
head(top_degs_up, n=3)
## Row.names baseMean log2FoldChange lfcSE pvalue padj
## 1 101929800 638.9409 10.670261 1.3486957 1.204149e-15 4.725211e-14
## 2 10563 527.2147 9.547816 0.9077753 3.547794e-26 3.969926e-24
## 3 10964 15418.6499 8.489990 0.7053236 1.227201e-34 2.547419e-32
## Gene.name
## 1 LINC03025
## 2 CXCL13
## 3 IFI44L
top_degs_down <-merge(topdown,namesdown,by='row.names',all=TRUE)
head(top_degs_down, n=3)
## Row.names baseMean log2FoldChange lfcSE pvalue padj
## 1 100129792 936.04789 -4.768061 0.9341559 1.480924e-08 1.770557e-07
## 2 101928200 19.65565 -6.053793 1.4696963 1.467513e-06 1.120353e-05
## 3 116372 84.40922 -5.484964 1.0521385 1.068566e-08 1.318635e-07
## Gene.name
## 1 CCDC152
## 2 MRGPRF-AS1
## 3 LYPD1
With the information we have collected to this point lets make some venn diagrams to check how many of our top genes are also reported in the paper
library(VennDiagram)
## Loading required package: grid
## Loading required package: futile.logger
library(RColorBrewer)
#vectors with the list of top genes from the paper
top_paper_up <-c("SERPINB7", "IL12B", "BCL2L14", "LOC101929800", "BX255923.3", "IDO1", "IL2RA",
"CSF3", "MT1JP", "CCL8", "IRG1", "CXCL13", "LOC102723642", "VEGFC", "IFITM1",
"IDO2", "LAMP3", "HAS1", "SERPINB4", "FBXO39", "RSAD2", "IFI44L", "CCL19",
"C1QTNF1", "LOC101929319", "GBP1P1", "CCL20", "TNFAIP6", "TNNT2", "UPB1")
top_paper_down <- c("C3orf22", "LYPD1", "ASB9", "MRO", "MYBL2", "SPC25", "TPD52L1", "TMEM37",
"ASIC", "MYZAP", "APLN", "PPAPDC1A", "SKA3", "SAMD13", "PNPLA3", "MYLK2",
"CMBL", "RRM2", "MCM10", "FAM111B", "CCSER1", "PGBD5", "TLR5", "GAL3ST4",
"FFAR4", "LOC102724574", "GABBR2", "DNASE2B", "PBK", "HJURP")
# Chart up
vup <- venn.diagram(
x = list(top_paper_up,namesup$Gene.name),
category.names = c("Top30 Up paper", "Top 30 Up me"),
filename = 'venn_diagram_up_degs.png',
output=TRUE,
# Circles
lwd = 2,
lty = 'blank',
fill = c("blue","yellow"),
# Numbers
cex = 1,
fontface = "bold",
fontfamily = "sans",
)
#chart down
vdown <- venn.diagram(
x = list(top_paper_down,namesdown$Gene.name),
category.names = c("Top30 Up paper", "Top 30 Up me"),
filename = 'venn_diagram_down_degs.png',
output=TRUE,
# Circles
lwd = 2,
lty = 'blank',
fill = c("blue","yellow"),
# Numbers
cex = 1,
fontface = "bold",
fontfamily = "sans",
)
As we can observe from previous plots, the identity of the top genes reported on the paper with the genes identified with this analysis is highly similar.Then, even if my filtering was strict I could capture the most relevant genes.
Figure 4.D in the manuscript present a volcano plot with the overall results of the DE analysis on LPS estimulated monocites and controls. I recreate the figure for comparisons. Now my figure looks like
#lets do some modification to our results to facilitate the plotting
#Adding relevant information to the DEGs, like: direction, hgcn symbol
#I fixed my problem with BiomaRt following this thread (https://stackoverflow.com/questions/77370659/error-failed-to-collect-lazy-table-caused-by-error-in-db-collect-using)
library(biomaRt)
ensembl=useMart(biomart = "ENSEMBL_MART_ENSEMBL",dataset="hsapiens_gene_ensembl")
databaseGenesEnsembl <- getBM(
attributes=c("entrezgene_id", "hgnc_symbol"),
values=rownames(res_shrink),
filters="entrezgene_id",
mart=ensembl
)
#results
full_results=merge(as.data.frame(res_shrink), databaseGenesEnsembl,by.x="row.names", by.y="entrezgene_id", all.x=TRUE)
#let's create a column to indicate weather a gene is deg or not
full_results$DE="N.S."
full_results$DE[which(full_results$Row.names %in% rownames(degs)[which(degs$log2FoldChange>0)])]="UP"
full_results$DE[which(full_results$Row.names %in% rownames(degs)[which(degs$log2FoldChange<0)])]="DOWN"
full_results$condlabel <- ifelse(full_results$padj< 1e-40, as.character(full_results$hgnc_symbol), NA)
#lets produce a volcano plot like the one the paper
library(ggplot2)
library(ggrepel)
threshold_pval=0.1
vp <- ggplot(data=full_results, aes(x=log2FoldChange, y=-log10(pvalue), col=DE,label=hgnc_symbol, ensembl=Row.names)) +
geom_point(alpha = 0.5) +
theme_minimal() +
scale_color_manual(values=c("blue","black","red")) +
geom_vline(xintercept=c(-2, 2), linetype="dashed") +
geom_hline(yintercept=-log10(threshold_pval), linetype="dashed") +
geom_text_repel(aes(label = condlabel), show.legend =FALSE)
vp
## Warning: Removed 73 rows containing missing values (`geom_point()`).
## Warning: Removed 14433 rows containing missing values (`geom_text_repel()`).
## Warning: ggrepel: 30 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
I am satisfied with the result, most of the representative genes are shared and the global distribution of genes is similar. As a final step I performed a ORA with panther, the methods of the paper does not specify the tool used for the ORA but indicate the correction method, ontologies and statitical test. So I tried to recreate it and compare with the supplementary table 2 (please refer to the world file 1-s2.0-S0753332221003991-mmc1)
###LETS DO FUNCTIONAL ANNOTATION!
#we are going to perform a ORA with Panther database as in the paper
library(rbioapi)
#First lets retrieve the avaiable gene sets to use from panther.
#rba_panther_info(what = "datasets")
#the ones we are interested in are those used in the paper:
#1 GO:0003674 molecular_function 10.5281/zenodo.8436609
#2 GO:0008150 biological_process 10.5281/zenodo.8436609
degs_up <- subset(degs,degs$log2FoldChange>0)
degs_down <- subset(degs,degs$log2FoldChange<0)
#ORA for upregulated genes
panther_bp_up <- rba_panther_enrich(genes=rownames(degs_up),
organism = 9606, #homo sapiens
annot_dataset = "GO:0008150", #biological process
test_type = "FISHER",
correction = "FDR")
## Performing over-representation enrichment analysis of 664 input genes of organism 9606 against GO:0008150 datasets.
head(panther_bp_up$result, n=5)
## number_in_list fold_enrichment fdr expected number_in_reference
## 1 120 5.164138 7.791725e-43 23.23718 825
## 2 193 2.988297 4.550264e-40 64.58528 2293
## 3 147 3.784632 1.777467e-39 38.84130 1379
## 4 147 3.776416 1.689978e-39 38.92580 1382
## 5 149 3.709687 2.223300e-39 40.16511 1426
## pValue plus_minus term.id term.label
## 1 5.075051e-47 + GO:0034097 response to cytokine
## 2 5.927524e-44 + GO:0002376 immune system process
## 3 3.473198e-43 + GO:0051707 response to other organism
## 4 4.402992e-43 + GO:0043207 response to external biotic stimulus
## 5 7.240603e-43 + GO:0009607 response to biotic stimulus
panther_mf_up <- rba_panther_enrich(genes=rownames(degs_up),
organism = 9606, #homo sapiens
annot_dataset = "GO:0003674", #molecular function
test_type = "FISHER",
correction = "FDR")
## Performing over-representation enrichment analysis of 664 input genes of organism 9606 against GO:0003674 datasets.
head(panther_mf_up$result, n=5)
## number_in_list fold_enrichment fdr expected number_in_reference
## 1 47 7.070602 9.006945e-20 6.647242 236
## 2 66 4.621751 7.450525e-20 14.280303 507
## 3 66 4.558809 9.772765e-20 14.477467 514
## 4 49 6.395842 1.155147e-19 7.661228 272
## 5 26 19.231034 2.707238e-19 1.351981 48
## pValue plus_minus term.id term.label
## 1 1.775467e-23 + GO:0005125 cytokine activity
## 2 2.937325e-23 + GO:0048018 receptor ligand activity
## 3 5.779282e-23 + GO:0030546 signaling receptor activator activity
## 4 9.108198e-23 + GO:0005126 cytokine receptor binding
## 5 2.668281e-22 + GO:0008009 chemokine activity
#ORA FOR Downregulated genes
panther_bp_down <- rba_panther_enrich(genes=rownames(degs_down),
organism = 9606, #homo sapiens
annot_dataset = "GO:0008150", #biological process
test_type = "FISHER",
correction = "FDR")
## Performing over-representation enrichment analysis of 454 input genes of organism 9606 against GO:0008150 datasets.
head(panther_bp_down$result, n=5)
## number_in_list fold_enrichment fdr expected number_in_reference
## 1 290 1.242669 2.193481e-10 233.36869 14924
## 2 101 2.178409 1.329852e-10 46.36412 2965
## 3 194 1.514079 1.300629e-09 128.13073 8194
## 4 163 1.609620 7.560303e-09 101.26612 6476
## 5 141 1.706148 1.041847e-08 82.64229 5285
## pValue plus_minus term.id
## 1 1.428698e-14 + GO:0009987
## 2 1.732367e-14 + GO:0051239
## 3 2.541449e-13 + GO:0050896
## 4 1.969726e-12 + GO:0051716
## 5 3.392974e-12 + GO:0007154
## term.label
## 1 cellular process
## 2 regulation of multicellular organismal process
## 3 response to stimulus
## 4 cellular response to stimulus
## 5 cell communication
panther_mf_down <- rba_panther_enrich(genes=rownames(degs_down),
organism = 9606, #homo sapiens
annot_dataset = "GO:0003674", #molecular function
test_type = "FISHER",
correction = "FDR")
## Performing over-representation enrichment analysis of 454 input genes of organism 9606 against GO:0003674 datasets.
head(panther_mf_down$result, n=5)
## number_in_list fold_enrichment fdr expected number_in_reference
## 1 18 7.194410 1.667975e-06 2.501943 160
## 2 57 2.068767 5.618184e-04 27.552642 1762
## 3 51 2.137265 9.508244e-04 23.862277 1526
## 4 72 1.772977 2.313846e-03 40.609654 2597
## 5 310 1.083904 2.564994e-03 286.003302 18290
## pValue plus_minus term.id term.label
## 1 3.287946e-10 + GO:0005178 integrin binding
## 2 2.214935e-07 + GO:0044877 protein-containing complex binding
## 3 5.622853e-07 + GO:0005102 signaling receptor binding
## 4 1.824440e-06 + GO:0036094 small molecule binding
## 5 2.528084e-06 + GO:0003674 molecular_function
The terms are not identical as was expected but in general the trend is similar the upregulated genes are enriched in terms associated to inflammation and inmmune activity mediated by interferon, while the downregulated genes are more enriched in terms associated to metabolism and biochemical processes. In conclusion, the analysis perfomed can recall many of the findings described on the paper but was not posible a 100% of reproducibility of the analysis and data since in the paper some technical details were missing.
P. Vishnyakova, A. Poltavets, E. Karpulevich, A. Maznina, V. Vtorushina, L. Mikhaleva, E. Kananykhina, A. Lokhonina, S. Kovalchuk, A. Makarov, A. Elchaninov, G. Sukhikh, T. Fatkhudinov.The response of two polar monocyte subsets to inflammation.Biomedicine & Pharmacotherapy,Volume 139,2021,111614,ISSN 0753-3322.https://doi.org/10.1016/j.biopha.2021.111614.
Tarazona S, Furió-Tarí P, Turrà D, Pietro AD, Nueda MJ, Ferrer A, Conesa A. Data quality aware analysis of differential expression in RNA-seq with NOISeq R/Bioc package. Nucleic Acids Res. 2015 Dec 2;43(21):e140. doi: 10.1093/nar/gkv711. Epub 2015 Jul 16. PMID: 26184878; PMCID: PMC4666377.
Evans C, Hardin J, Stoebel DM. Selecting between-sample RNA-Seq normalization methods from the perspective of their assumptions. Brief Bioinform. 2018 Sep 28;19(5):776-792. doi: 10.1093/bib/bbx008. PMID: 28334202; PMCID: PMC6171491.
4.Rau A, Gallopin M, Celeux G, Jaffrezic F (2013). “Data-based filtering for replicated high-throughput transcriptome sequencing experiments.” Bioinformatics, 29(17), 2146-2152.